<<<<<<< HEAD =======

Bay Area useR Group - Time Series Workshop

Rami Krispin http://twitter.com/rami_krispin , Danton Noriega http://twitter.com/dantonnoriega
2019-10-05

Table of Contents


Agenda

Today, we will mainly focus on methods for analyzing and forecasting regular time-series data with seasonality patterns

<<<<<<< HEAD

Quick pool

Assumptions

=======

Assumptions

>>>>>>> arima <<<<<<< HEAD

Why R?

=======

Why R?

>>>>>>> arima

Goals

By the end of this workshop, you probably won’t become an expert in time series analysis and forecasting, but you will be able to:

Admin

Workshop material

All today’s slides, code, and rmarkdown files are available on GitHub

Downloading the workshop material from the terminal:


git clone https://github.com/RamiKrispin/Time-Series-Workshop.git

Or lunch it from a docker container:


Introduction to time series analysis

Time series analysis is commonly used in many fields of science, such as economics, finance, physics, engineering, and astronomy. The usage of time series analysis to understand past events and to predict future ones did not start with the introduction of the stochastic process during the past century. Ancient civilizations such as the Greeks, Romans, or Mayans, researched and learned how to utilize cycled events such as weather and astronomy to predict future events.

Time series analysis - is the art of extracting meaningful insights from time-series data to learn about past events and to predict future events.

This process includes the following steps:

Generally, in R this process will look like this:

Of course, there are more great packages that could be part of this process such as zoo, xts, bsts, forecastHybird, prophet, etc.

<<<<<<< HEAD

The TSstudio package

The TSstudio package provides a set of functions for time series analysis. That includes interactive data visualization tools based on the plotly package engine, supporting multiple time series objects such as ts, xts, and zoo. The following diagram demonstrates the workflow of the TSstudio package:

=======

The TSstudio package

The TSstudio package provides a set of functions for time series analysis. That includes interactive data visualization tools based on the plotly package engine, supporting multiple time series objects such as ts, xts, and zoo. In addition, the package provides a set of utility functions for preprocessing time series data, and as well backtesting applications for forecasting models from the forecast, forecastHybrid and bsts packages.

>>>>>>> arima

Time series data

Time series data - is a sequence of values, each associate to a unique point in time that can divide to the following two groups:

  • Regular time series - is a sequence of observations which were captured at equally spaced time intervals (e.g., every month, week, day, hour, etc.)
  • Irregular time series - or unevenly spaced time series, is a sequence of observations which were not captured on equally spaced time intervals (for example rainy days, earthquakes, clinical trials, etc.)

Note: typically, the term time series data referred to regular time-series data. Therefore, if not stated otherwise, throughout the workshop the term time series (or series) refer to regular time-series data

Examples of time series data

<<<<<<< HEAD

Examples of time series data

Applications

=======

Applications

>>>>>>> arima

With time series analysis, you can answer questions such as:

  • How many vehicles, approximately, going to be sold in the US in the next 12 months?
  • What will be the estimated demand for natural gas in the US in the next five years?
  • Generally, what will be the demand for electricity in the UK during the next 24 hours?

Time series objects

There are multiple classes in R for time-series data, the most common types are:

  • The ts class for regular time-series data, and mts class for multiple time seires objects , the most common class for time series data
  • The xts and zoo classes for both regular and irregular time series data, mainly popular in the financial field
  • The tsibble class, a tidy format for time series data, support both regular and irregular time-series data

The attribute of time series object

A typical time series object should have the following attributes:

  • A vector or matrix objects with sequential observations
  • Index or timestamp
  • Frequency units
  • Cycle units

Where the frequency of the series represents the units of the cycle. For example, for monthly series, the frequency units are the month of the year, and the cycle units are the years. Similarly, for daily series, the frequency units could be the day of the year, and the cycle units are also the years.

The stats package provides a set of functions for handling and extracting information from a ts object. The frequency and cycle functions, as their names implay return the frequency and the cycle, respectivly, of the object. Let’s load the USgas series from the TSstudio package and apply those functions:


library(TSstudio)
data(USgas)

class(USgas)

[1] "ts"

is.ts(USgas)

[1] TRUE

frequency(USgas)

[1] 12

cycle(USgas)

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000   1   2   3   4   5   6   7   8   9  10  11  12
2001   1   2   3   4   5   6   7   8   9  10  11  12
2002   1   2   3   4   5   6   7   8   9  10  11  12
2003   1   2   3   4   5   6   7   8   9  10  11  12
2004   1   2   3   4   5   6   7   8   9  10  11  12
2005   1   2   3   4   5   6   7   8   9  10  11  12
2006   1   2   3   4   5   6   7   8   9  10  11  12
2007   1   2   3   4   5   6   7   8   9  10  11  12
2008   1   2   3   4   5   6   7   8   9  10  11  12
2009   1   2   3   4   5   6   7   8   9  10  11  12
2010   1   2   3   4   5   6   7   8   9  10  11  12
2011   1   2   3   4   5   6   7   8   9  10  11  12
2012   1   2   3   4   5   6   7   8   9  10  11  12
2013   1   2   3   4   5   6   7   8   9  10  11  12
2014   1   2   3   4   5   6   7   8   9  10  11  12
2015   1   2   3   4   5   6   7   8   9  10  11  12
2016   1   2   3   4   5   6   7   8   9  10  11  12
2017   1   2   3   4   5   6   7   8   9  10  11  12
2018   1   2   3   4   5   6   7   8   9  10  11    

The time function returns the series index or timestamp:


head(time(USgas))

[1] 2000.000 2000.083 2000.167 2000.250 2000.333 2000.417

The deltat function returns the length of series’ time interval (which is equivalent to 1/frequency):


deltat(USgas)

[1] 0.08333333

Similarly, the start and end functions return the starting and ending time of the series, respectively:


start(USgas)

[1] 2000    1

end(USgas)

[1] 2018   11

Where the left number represents the cycle units, and the right side represents the frequency units of the series. The tsp function returns both the start and end of the series and its frequency:


tsp(USgas)

[1] 2000.000 2018.833   12.000

Last but not least, the ts_info function from the TSstudio package returns a concise summary of the series:


ts_info(USgas)

 The USgas series is a ts object with 1 variable and 227 observations
 Frequency: 12 
 Start time: 2000 1 
 End time: 2018 11 

Creating a ts object

The ts function allows to create a ts object from a single vector and a mts object from a multiple vectors (or matrix). By defining the start (or end) and frequency of the series, the function generate the object index. In the following example we will load the US_indicators dataset from the TSstudio package and convert it to a ts object. The US_indicators is a data.frame with the monthly vehicle sales and unemployment rate in the US since 1976:


data(US_indicators)

head(US_indicators)

        Date Vehicle Sales Unemployment Rate
1 1976-01-31         885.2               8.8
2 1976-02-29         994.7               8.7
3 1976-03-31        1243.6               8.1
4 1976-04-30        1191.2               7.4
5 1976-05-31        1203.2               6.8
6 1976-06-30        1254.7               8.0

mts_obj <- ts(data = US_indicators[, c("Vehicle Sales", "Unemployment Rate")], 
              start = c(1976, 1),
              frequency = 12)

ts_info(mts_obj)

 The mts_obj series is a mts object with 2 variables and 517 observations
 Frequency: 12 
 Start time: 1976 1 
 End time: 2019 1 

How to define the start and frequency arguments?

Series Type Cycle Units Frequency Units Frequency Example
Quarterly Years Quarter of the year 4 ts(x, start = c(2019, 2), frequency = 4)
Monthly Years Month of the year 12 ts(x, start = c(2019, 1), frequency = 12)
Weekly Years Week of the year 52 ts(x, start = c(2019, 13), frequency = 52)
Daily Years Day of the year 365 ts(x, start = c(2019, 290), frequency = 365)

What if you have more granular time series data such as half-hour, 15 or five minutes intervals?

Me when needed to work with daily time series using ts object:

The disadvantages of the ts object

The ts object was designed for work with monthly, quarterly, or yearly series that have only two-time components (e.g., year and month). Yet, more granular series (high frequency) may have more than two-time components. A common example is a daily series that has the following time attributes:

  • Year
  • Month
  • Day of the year
  • Day of the week

When going to the hourly or minute levels, this is even adding more components such as the hour, minute, etc.

The zoo, xts classes and now the tsibble class provide solution for this issue.

The tsibble class

“The tsibble package provides a data infrastructure for tidy temporal data with wrangling tools…”

In other words, the tsibble object allows you to work with a data frame alike (i.e., tbl object) with a time awareness attribute. The key characteristics of this class:

  • It has a date/time object as an index
  • Using key to store multiple time series objects
  • A tbl object - can apply any of the normal tools to reformat, clean or modify tbl object such as dplyr functions

The reaction of me and my colegues when the tsibble package was released:

Creating a tsibble object


library(UKgrid)

data(UKgrid)

class(UKgrid)
<<<<<<< HEAD
## [1] "data.frame"
head(UKgrid)
##             TIMESTAMP    ND I014_ND   TSD I014_TSD ENGLAND_WALES_DEMAND
## 1 2011-01-01 00:00:00 34606   34677 35648    35685                31058
## 2 2011-01-01 00:30:00 35092   35142 36089    36142                31460
## 3 2011-01-01 01:00:00 34725   34761 36256    36234                31109
## 4 2011-01-01 01:30:00 33649   33698 35628    35675                30174
## 5 2011-01-01 02:00:00 32644   32698 34752    34805                29253
## 6 2011-01-01 02:30:00 32092   32112 34134    34102                28688
##   EMBEDDED_WIND_GENERATION EMBEDDED_WIND_CAPACITY
## 1                      484                   1730
## 2                      520                   1730
## 3                      520                   1730
## 4                      512                   1730
## 5                      512                   1730
## 6                      464                   1730
##   EMBEDDED_SOLAR_GENERATION EMBEDDED_SOLAR_CAPACITY NON_BM_STOR
## 1                         0                      79           0
## 2                         0                      79           0
## 3                         0                      79           0
## 4                         0                      79           0
## 5                         0                      79           0
## 6                         0                      79           0
##   PUMP_STORAGE_PUMPING I014_PUMP_STORAGE_PUMPING FRENCH_FLOW BRITNED_FLOW
## 1                   60                        67        1939            0
## 2                   16                        20        1939            0
## 3                  549                       558        1989            0
## 4                  998                       997        1991            0
## 5                 1126                      1127        1992            0
## 6                 1061                      1066        1992            0
##   MOYLE_FLOW EAST_WEST_FLOW I014_FRENCH_FLOW I014_BRITNED_FLOW
## 1       -382              0             1922                 0
## 2       -381              0             1922                 0
## 3       -382              0             1974                 0
## 4       -381              0             1975                 0
## 5       -382              0             1975                 0
## 6       -381              0             1975                 0
##   I014_MOYLE_FLOW I014_EAST_WEST_FLOW
## 1            -382                   0
## 2            -381                   0
## 3            -382                   0
## 4            -381                   0
## 5            -382                   0
## 6            -381                   0
library(dplyr)
library(tsibble)
=======

[1] "data.frame"

head(UKgrid)

            TIMESTAMP    ND I014_ND   TSD I014_TSD
1 2011-01-01 00:00:00 34606   34677 35648    35685
2 2011-01-01 00:30:00 35092   35142 36089    36142
3 2011-01-01 01:00:00 34725   34761 36256    36234
4 2011-01-01 01:30:00 33649   33698 35628    35675
5 2011-01-01 02:00:00 32644   32698 34752    34805
6 2011-01-01 02:30:00 32092   32112 34134    34102
  ENGLAND_WALES_DEMAND EMBEDDED_WIND_GENERATION
1                31058                      484
2                31460                      520
3                31109                      520
4                30174                      512
5                29253                      512
6                28688                      464
  EMBEDDED_WIND_CAPACITY EMBEDDED_SOLAR_GENERATION
1                   1730                         0
2                   1730                         0
3                   1730                         0
4                   1730                         0
5                   1730                         0
6                   1730                         0
  EMBEDDED_SOLAR_CAPACITY NON_BM_STOR PUMP_STORAGE_PUMPING
1                      79           0                   60
2                      79           0                   16
3                      79           0                  549
4                      79           0                  998
5                      79           0                 1126
6                      79           0                 1061
  I014_PUMP_STORAGE_PUMPING FRENCH_FLOW BRITNED_FLOW MOYLE_FLOW
1                        67        1939            0       -382
2                        20        1939            0       -381
3                       558        1989            0       -382
4                       997        1991            0       -381
5                      1127        1992            0       -382
6                      1066        1992            0       -381
  EAST_WEST_FLOW I014_FRENCH_FLOW I014_BRITNED_FLOW I014_MOYLE_FLOW
1              0             1922                 0            -382
2              0             1922                 0            -381
3              0             1974                 0            -382
4              0             1975                 0            -381
5              0             1975                 0            -382
6              0             1975                 0            -381
  I014_EAST_WEST_FLOW
1                   0
2                   0
3                   0
4                   0
5                   0
6                   0

library(dplyr)

>>>>>>> arima
uk_grid <- UKgrid %>% 
  dplyr::select(time = TIMESTAMP, 
                net_demand = ND, 
                wind_gen = EMBEDDED_WIND_GENERATION, 
                solar_gen = EMBEDDED_SOLAR_GENERATION) %>%
  as_tsibble(index = time)
  

head(uk_grid)
## # A tsibble: 6 x 4 [30m] <UTC>
##   time                net_demand wind_gen solar_gen
##   <dttm>                   <int>    <int>     <int>
## 1 2011-01-01 00:00:00      34606      484         0
## 2 2011-01-01 00:30:00      35092      520         0
## 3 2011-01-01 01:00:00      34725      520         0
## 4 2011-01-01 01:30:00      33649      512         0
## 5 2011-01-01 02:00:00      32644      512         0
## 6 2011-01-01 02:30:00      32092      464         0
class(uk_grid)
## [1] "tbl_ts"     "tbl_df"     "tbl"        "data.frame"
index(uk_grid)
## time
interval(uk_grid)
## 30m
<<<<<<< HEAD

Descriptive analysis of time series

Like most common fields of statistics and machine learning, the goal of the descriptive analysis is to reveal meaningful insights about the series with the use of descriptive statistics and data visualization tools.

Plotting time series object

The plot function or plot.ts functions are R built-in functions for plotting time series object:

data("USVSales")

plot.ts(USVSales, 
        main = "US Monthly Total Vehicle Sales",
        ylab = "Thousands of units",
        xlab = "Date")

Alternatively, the ts_plot function from the TSstudio package provides an interactive visualization for time series object (ts, xts, zoo, tsibble, ets.). This function using the plotly package plotting engine:

ts_plot(USVSales, 
        title = "US Monthly Total Vehicle Sales",
        Ytitle = "Thousands of units",
        Xtitle = "Date",
        slider = TRUE)

The main advantage of using interactive data visualization tools that it allows you to zoom in the data with a click of a button. This is super useful when working with data and in particular, with time-series data.

The dygraphs package is another great tool for visualization time series data:

library(dygraphs)

dygraph(USVSales, 
        main = "US Monthly Total Vehicle Sales",
        ylab = "Thousands of units",
        xlab = "Date") %>% 
  dyRangeSelector()

The time series components

Time series data, typically, would have two types of patterns:

Structural patterns:

  • Trend - define the general growth of the series and its rate (e.g., linear, exponential, etc.)
  • Cycle - derived from the broad definition of a cycle in macroeconomics. A cycle can be described as a sequence of repeatable events over time, where the starting point of a cycle is at a local minimum of the series and the ending point is at the next one, and the ending point of one cycle is the starting point of the following cycle.
  • Seasonal - define the variation of the series that related to the frequency units of the series

Nonstructural patterns

The irregular component - which include any other patterns that are not captured by the trend, cycle, and seasonal components. For example structural changes, non-seasonal holidays effect, etc.

Together, the structural and non-structural patterns compounding the series, which can be formalized by the following expressions:

=======

Descriptive analysis of time series

Like most common fields of statistics and machine learning, the goal of the descriptive analysis is to reveal meaningful insights about the series and the key components of its structure.

The time series components

>>>>>>> arima
  • \(Y_t = T_t + C_t + S_t + I_t\), when the series has an additive structure, or

  • \(Y_t = T_t \times C_t \times S_t \times I_t\), when the series has a multiplicative structure

<<<<<<< HEAD
ts_decompose(USgas)

Decomposition of time series object

Seasonal analysis

ts_seasonal(USgas, type = "normal")
ts_seasonal(USgas, type = "cycle")
ts_seasonal(USgas, type = "box")
USgas_decompose <- USgas - decompose(USgas)$trend
ts_seasonal(USgas_decompose, type = "all")
ts_heatmap(USgas, color = "Reds")
ts_heatmap(USVSales, color = "Reds")
ts_surface(USgas)
ts_polar(USgas)

Correlation Analysis

UKgrid_daily <- extract_grid(type = "ts", aggregate = "daily")

acf(UKgrid_daily)

pacf(UKgrid_daily)

ts_cor(UKgrid_daily, lag.max = 365 * 2)
ts_cor(UKgrid_daily, lag.max = 365 * 2, seasonal_lags = 7)
ts_lags(USgas)
ts_lags(USgas, lags = c(12, 24, 36))
=======

ts_decompose(USgas)

ARIMA

Sources - ARIMA

Brief Overview

  • White Noise and Random Walks
  • Autoregressive (AR) models
  • Moving average (MA) models
  • Autoregressive moving average (ARMA) models
  • Using ACF & PACF for model ID

Goals

Leave this room with a shallow understanding of

  • Stationarity
  • Differencing
  • ARMA Identification
Model ACF PACF
AR(p) Tails off slowly Cuts off after lag p
MA(q) Cuts off after lag q Tails off slowly
ARMA(p,q) Tails off slowly Tails off slowly
  • ARIMA
  • Seasonal ARIMA
  • Linear Regression with ARIMA errors

Gaussian white noise

We often assume so-called Gaussian white noise, whereby

\[ w_t \sim \text{N}(0,\sigma^2) \]

and the following apply as well

  • autocovariance: \[ \gamma_k = \begin{cases} \sigma^2 & \text{if } k = 0 \\ 0 & \text{if } k \geq 1 \end{cases} \]
  • autocorrelation: \[ \rho_k = \begin{cases} 1 & \text{if } k = 0 \\ 0 & \text{if } k \geq 1 \end{cases} \]

par(mfrow = c(1,2), mai = c(.6,0.6,0.1,0.2), oma = c(0,0,1.5,0), mgp = c(1.7,.6,0))
tt <- rnorm(100)
plot.ts(tt, ylab = expression(italic(w[t])))
acf(tt, main = "")
title(expression(w[t] %~% N(0,1)), outer = TRUE)

>>>>>>> arima

Forecasting with linear regression

Random walk (RW)

A time series \(\{x_t\}\) is a random walk if

  1. \(x_t = x_{t-1} + w_t\)
  2. \(w_t\) is white noise

par(mfrow = c(1,2), mai = c(.6,0.6,0.1,0.2), oma = c(0,0,1.5,0), mgp = c(1.7,.6,0))
tt <- cumsum(rnorm(100))
plot.ts(tt, ylab = expression(italic(x[t])))
acf(tt, main = "")
title(expression(list(x[t] == x[t-1] + w[t], w[t] %~% N(0,1))), outer = T)

Biased random walk

A biased random walk (or random walk with drift) is written as

\[ x_t = x_{t-1} + u + w_t \]

where \(u\) is the bias (drift) per time step and \(w_t\) is white noise


par(mfrow = c(1,2), mai = c(.6,0.6,0.1,0.2), oma = c(0,0,1.5,0), mgp = c(1.7,.6,0))
xx <- ww <- rnorm(100)
uu <- 1
for(t in 2:100) {
  xx[t] <- xx[t-1] + uu + ww[t]
}
plot.ts(xx, ylab = expression(italic(x[t])))
acf(tt, main = "")
title(expression(list(x[t] == x[t-1] + 1 + w[t], w[t] %~% N(0,1))), outer = T)

Differencing a biased random walk

First-differencing a biased random walk yields a constant mean (level) \(u\) plus white noise

\[ \begin{align} \Delta x_t &= x_{t-1} + u + w_t \\ x_t - x_{t-1} &= x_{t-1} + u + w_t - x_{t-1} \\ x_t - x_{t-1} &= u + w_t \end{align} \]


par(mfrow = c(1,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
plot.ts(diff(xx), ylab = expression(paste(Delta,italic(x[t]))))
acf(tt, main="")
title(expression(list(paste(Delta, x[t]) == 1 + w[t],w[t] %~% N(0,1))), outer = T, line = .7)

Autoregressive (AR) models

Autoregressive models treat a current state of nature as a function its past state(s)

An autoregressive model of order p, or AR(p), is defined as

\[ x_t = \phi_1 x_{t-1} + \phi_2 x_{t-2} + \dots + \phi_p x_{t-p} + w_t \]

where we assume

  1. \(w_t\) is white noise
  2. \(\phi_p \neq 0\) for an order-p process

Examples of AR(p) models

  • AR(1)
    • \(x_t = 0.5 x_{t-1} + w_t\)
  • AR(1) with \(\phi_1 = 1\) (random walk)
    • \(x_t = x_{t-1} + w_t\)
  • AR(2)
    • \(x_t = -0.2 x_{t-1} + 0.4 x_{t-2} + w_t\)

set.seed(123)
### the 4 AR coefficients
ARp <- c(0.7, 0.2, -0.1, -0.3)
### empty list for storing models
AR_mods <- vector("list", 4L)

par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(p in 1:4) {
  ## assume SD=1, so not specified
  AR_mods[[p]] <- arima.sim(n=50, list(ar=ARp[1:p]))
  plot.ts(AR_mods[[p]], las = 1,
          ylab = expression(italic(x[t])))
  mtext(side = 3, paste0("AR(",p,")"),
        line = 0.3, adj = 0, cex = 0.8)
}

Stationary AR(p) models

Recall that stationary processes have the following properties

  1. no systematic change in the mean or variance
  2. no systematic trend
  3. no periodic variations or seasonality

We seek a means for identifying whether our AR(p) models are also stationary

We can write out an AR(p) model using the backshift operator

\[ x_t = \phi_1 x_{t-1} + \phi_2 x_{t-2} + \dots + \phi_p x_{t-p} + w_t \\ \Downarrow \\ \begin{align} x_t - \phi_1 x_{t-1} - \phi_2 x_{t-2} - \dots - \phi_p x_{t-p} &= w_t \\ (1 - \phi_1 \mathbf{B} - \phi_2 \mathbf{B}^2 - \dots - \phi_p \mathbf{B}^p) x_t &= w_t \\ \phi_p (\mathbf{B}) x_t &= w_t \\ \end{align} \]

If we treat \(\mathbf{B}\) as a number (or numbers), we can out write the characteristic equation as

\[ \phi_p (\mathbf{B}) x_t = w_t \\ \Downarrow \\ \phi_p (\mathbf{B}) = 0 \]

To be stationary, all roots of the characteristic equation must exceed 1 in absolute value

Example: Consider the AR(1) model from earlier

\[ \begin{align} x_t &= 0.5 x_{t-1} + w_t \\ x_t - 0.5 x_{t-1} &= w_t \\ (1 - 0.5 \mathbf{B})x_t &= w_t \\ \Downarrow \\ 1 - 0.5 \mathbf{B} &= 0 \\ -0.5 \mathbf{B} &= -1 \\ \mathbf{B} &= 2 \\ \end{align} \]

This model is stationary because \(\mathbf{B} > 1\)

What about this AR(2) model from earlier?

\[ \begin{align} x_t &= -0.2 x_{t-1} + 0.4 x_{t-2} + w_t \\ x_t + 0.2 x_{t-1} - 0.4 x_{t-2} &= w_t \\ (1 + 0.2 \mathbf{B} - 0.4 \mathbf{B}^2)x_t &= w_t \\ \Downarrow \\ 1 + 0.2 \mathbf{B} - 0.4 \mathbf{B}^2 &= 0 \\ \Downarrow \\ \mathbf{B} \approx -1.35 ~ \text{and}& ~ \mathbf{B} \approx 1.85 \end{align} \]

This model is not stationary because only one \(\mathbf{B} > 1\)

What about random walks?

Consider our random walk model

\[ \begin{align} x_t &= x_{t-1} + w_t \\ x_t - x_{t-1} &= w_t \\ (1 - 1 \mathbf{B})x_t &= w_t \\ \Downarrow \\ 1 - 1 \mathbf{B} &= 0 \\ -1 \mathbf{B} &= -1 \\ \mathbf{B} &= 1 \\ \end{align} \]

Random walks are not stationary because \(\mathbf{B} = 1 \ngtr 1\)

Coefficients of AR(1) models

Same value, but different sign


set.seed(123)
### list description for AR(1) model with small coef
AR_pos <- list(order=c(1,0,0), ar=0.7, sd=0.1)
### list description for AR(1) model with large coef
AR_neg <- list(order=c(1,0,0), ar=-0.7, sd=0.1)
### simulate AR(1)
AR1_pos <- arima.sim(n=500, model=AR_pos)
AR1_neg <- arima.sim(n=500, model=AR_neg)

### get y-limits for common plots
ylm1 <- c(min(AR1_pos[1:50],AR1_neg[1:50]), max(AR1_pos[1:50],AR1_neg[1:50]))

### set the margins & text size
par(mfrow = c(1,2), mai = c(.6,0.6,0.1,0.2), oma = c(0,0,1.5,0), mgp = c(1.7,.6,0))
### plot the ts
plot.ts(AR1_pos[1:50], ylim=ylm1, las = 1,
        ylab=expression(italic(x)[italic(t)]))
mtext(side = 3, expression(paste(phi[1]," = 0.7")),
      line = 0.4, adj = 0)
plot.ts(AR1_neg[1:50], ylim=ylm1, las = 1,
        ylab=expression(italic(x)[italic(t)]))
mtext(side = 3, expression(paste(phi[1]," = -0.7")),
      line = 0.4, adj = 0)

Both positive, but different magnitude


set.seed(123)
### list description for AR(1) model with small coef
AR_bg <- list(order=c(1,0,0), ar=0.9, sd=0.1)
### list description for AR(1) model with large coef
AR_sm <- list(order=c(1,0,0), ar=0.1, sd=0.1)
### simulate AR(1)
AR1_bg <- arima.sim(n=500, model=AR_bg)
AR1_sm <- arima.sim(n=500, model=AR_sm)

### get y-limits for common plots
ylm2 <- c(min(AR1_bg[1:50],AR1_sm[1:50]), max(AR1_bg[1:50],AR1_sm[1:50]))

### set the margins & text size
par(mfrow = c(1,2), mai = c(.6,0.6,0.1,0.2), oma = c(0,0,1.5,0), mgp = c(1.7,.6,0))
### plot the ts
plot.ts(AR1_bg[1:50], ylim = ylm2, las = 1,
        ylab = expression(italic(x)[italic(t)]),
        main = "")
mtext(side = 3, expression(paste(phi[1]," = 0.9")),
      line = 0.4, adj = 0)
plot.ts(AR1_sm[1:50], ylim = ylm2, las = 1,
        ylab = expression(italic(x)[italic(t)]),
        main = "")
mtext(side = 3, expression(paste(phi[1]," = 0.1")),
      line = 0.4, adj = 0)

ACF & PACF for AR(p) models

Autocorrelation function (ACF)

Recall that the autocorrelation function (\(\rho_k\)) measures the correlation between \(\{x_t\}\) and a shifted version of itself \(\{x_{t+k}\}\)

ACF for AR(1) models

ACF oscillates for model with \(-\phi\)


### set the margins & text size
par(mfrow=c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### plot the ts
plot.ts(AR1_pos[1:50], ylim=ylm1, las = 1,
        ylab=expression(italic(x)[italic(t)]),
        main = "")
mtext(side = 3, expression(paste(phi[1]," = 0.7")),
      line = 0.2, adj = 0, cex = 0.8)
acf(AR1_pos, lag.max = 20, las = 1)
plot.ts(AR1_neg[1:50], ylim=ylm1, las = 1,
        ylab=expression(italic(x)[italic(t)]),
        main = "")
mtext(side = 3, expression(paste(phi[1]," = -0.7")),
      line = 0.2, adj = 0, cex = 0.8)
acf(AR1_neg, lag.max = 20, las = 1)

For model with large \(\phi\), ACF has longer tail


### set the margins & text size
par(mfrow=c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### plot the ts
plot.ts(AR1_bg[1:50], ylim = ylm2, las = 1,
        ylab = expression(italic(x)[italic(t)]),
        main = "")
mtext(side = 3, expression(paste(phi[1]," = 0.9")),
      line = 0.2, adj = 0, cex = 0.8)
acf(AR1_bg, lag.max = 20, las = 1)
plot.ts(AR1_sm[1:50], ylim = ylm2, las = 1,
        ylab = expression(italic(x)[italic(t)]),
        main = "")
mtext(side = 3, expression(paste(phi[1]," = 0.1")),
      line = 0.2, adj = 0, cex = 0.8)
acf(AR1_sm, lag.max = 20, las = 1)

Partial autocorrelation funcion (PACF)

Recall that the partial autocorrelation function (\(\phi_k\)) measures the correlation between \(\{x_t\}\) and a shifted version of itself \(\{x_{t+k}\}\), with the linear dependence of \(\{x_{t-1},x_{t-2},\dots,x_{t-k-1}\}\) removed


### set 3 AR coefficients
ARp3 <- list(c(0.7, 0.2, -0.1), c(-0.7, 0.2, 0.1))

expr <- list(expression(paste("AR(3) with ", phi[1], " = 0.7, ",
                              phi[2], " = 0.2, ", phi[3], " = -0.1")),
             expression(paste("AR(3) with ", phi[1], " = -0.7, ",
                              phi[2], " = 0.2, ", phi[3], " = 0.1")))

### empty list for storing models
AR3_mods <- vector("list", 2L)

par(mfrow = c(2,3), mai = c(0.5,0.4,0.3,0.1), omi = c(0,0,0,0), mgp = c(1.8, .7, 0), cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(p in 1:2) {
  ## assume SD=1, so not specified
  AR3_mods[[p]] <- arima.sim(n=5000, list(ar=ARp3[[p]]))
  plot.ts(AR3_mods[[p]][1:50], las = 1,
          ylab = expression(italic(x[t])))
  acf(AR3_mods[[p]], lag.max = 20,
      las = 1, main = "")
  mtext(side = 3, expr[[p]],
        line = 0.2, adj = 0.5, cex = .8)
  pacf(AR3_mods[[p]], lag.max = 20,
       las = 1, main = "")
}

PACF for AR(p) models


### empty list for storing models
pacf_mods <- vector("list", 4L)

par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(p in 1:4) {
  pacf_mods[[p]] <- arima.sim(n=5000, list(ar=ARp[1:p]))
  pacf(pacf_mods[[p]], lag.max = 15,
       las = 1, main = "")
  mtext(side = 3, paste0("AR(",p,")"),
        line = 0.2, adj = 0)
}

Do you see the link between the order p and lag k?

Using ACF & PACF for model ID (1)

Model ACF PACF
AR(p) Tails off slowly Cuts off after lag p

Moving average (MA) models

Moving average models are most commonly used for forecasting a future state

A moving average model of order q, or MA(q), is defined as

\[ x_t = w_t + \theta_1 w_{t-1} + \theta_2 w_{t-2} + \dots + \theta_q w_{t-q} \]

where \(w_t\) is white noise

Each of the \(x_t\) is a sum of the most recent error terms

A moving average model of order q, or MA(q), is defined as

\[ x_t = w_t + \theta_1 w_{t-1} + \theta_2 w_{t-2} + \dots + \theta_q w_{t-q} \]

where \(w_t\) is white noise

Each of the \(x_t\) is a sum of the most recent error terms

Thus, all MA processes are stationary because they are finite sums of stationary WN processes

Examples of MA(q) models


### the 4 MA coefficients
MAq <- c(0.7, 0.2, -0.1, -0.3)
### empty list for storing models
MA_mods <- vector("list", 4L)
### loop over orders of q
for(q in 1:4) {
  ## assume SD=1, so not specified
  MA_mods[[q]] <- arima.sim(n=500, list(ma=MAq[1:q]))
}

par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(q in 1:4) {
  ## assume SD=1, so not specified
  plot.ts(MA_mods[[q]][1:50], las = 1,
          ylab = expression(italic(x[t])))
  mtext(side = 3, paste0("MA(",q,")"),
        line = 0.2, adj = 0, cex = 0.8)
}

Compare MA(1) & MA(2) with similar structure


MA1 <- arima.sim(n=50, list(ma=c(0.7)))
MA2 <- arima.sim(n=50, list(ma=c(-1, 0.7)))

par(mfrow = c(1,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
plot.ts(MA1, las = 1,
        ylab = expression(italic(x[t])))
mtext(side = 3,
      expression(MA(1):~italic(x[t])==~italic(w[t])+0.7~italic(w[t-1])),
      line = 0.2, adj = 0, cex = 0.8)
plot.ts(MA2, las = 1,
        ylab = expression(italic(x[t])))
mtext(side = 3,
      expression(MA(2):~italic(x[t])==~italic(w[t])-~italic(w[t-1])+0.7~italic(w[t-2])),
      line = 0.2, adj = 0, cex = 0.8)

AR(p) model as an MA(\(\infty\)) model

It is possible to write an AR(p) model as an MA(\(\infty\)) model

For example, consider an AR(1) model

\[ \begin{align} x_t &= \phi x_{t-1} + w_t \\ x_t &= \phi (\phi x_{t-2} + w_{t-1}) + w_t \\ x_t &= \phi^2 x_{t-2} + \phi w_{t-1} + w_t \\ x_t &= \phi^3 x_{t-3} + \phi^2 w_{t-2} + \phi w_{t-1} + w_t \\ & \Downarrow \\ x_t &= w_t + \phi w_{t-1}+ \phi^2 w_{t-2} + \dots + \phi^k w_{t-k} + \phi^{k+1} x_{t-k-1} \end{align} \]

If our AR(1) model is stationary, then

\[ \lvert \phi \rvert < 1 ~ \Rightarrow ~ \lim_{k \to \infty} \phi^{k+1} = 0 \]

so

\[ \begin{align} x_t &= w_t + \phi w_{t-1}+ \phi^2 w_{t-2} + \dots + \phi^k w_{t-k} + \phi^{k+1} x_{t-k-1} \\ & \Downarrow \\ x_t &= w_t + \phi w_{t-1}+ \phi^2 w_{t-2} + \dots + \phi^k w_{t-k} \end{align} \] \[ \]

Invertible MA(q) models

An MA(q) process is invertible if it can be written as a stationary autoregressive process of infinite order without an error term

For example, consider an MA(1) model

\[ \begin{align} x_t &= w_t + \theta w_{t-1} \\ & \Downarrow \\ w_t &= x_t - \theta w_{t-1} \\ w_t &= x_t - \theta (x_{t-1} - \theta w_{t-2}) \\ w_t &= x_t - \theta x_{t-1} - \theta^2 w_{t-2} \\ & ~~\vdots \\ w_t &= x_t - \theta x_{t-1} + \dots + (-\theta)^k x_{t-k} + (-\theta)^{k+1} w_{t-k-1} \\ \end{align} \]

If we constrain \(\lvert \theta \rvert < 1\), then

\[ \lim_{k \to \infty} (-\theta)^{k+1} w_{t-k-1} = 0 \]

and

\[ \begin{align} w_t &= x_t - \theta x_{t-1} + \dots + (-\theta)^k x_{t-k} + (-\theta)^{k+1} w_{t-k-1} \\ & \Downarrow \\ w_t &= x_t - \theta x_{t-1} + \dots + (-\theta)^k x_{t-k} \\ w_t &= x_t + \sum_{k=1}^\infty(-\theta)^k x_{t-k} \end{align} \]

Q: Why do we care if an MA(q) model is invertible?

A: It helps us identify the model’s parameters

Example, these MA(1) models are equivalent

  • \(x_t = w_t + \frac{1}{5} w_{t-1}, ~\text{with} ~w_t \sim ~\text{N}(0,25)\)
  • \(x_t = w_t + 5 w_{t-1}, ~\text{with} ~w_t \sim ~\text{N}(0,1)\)

But we select the first model because it is invertable (the coefficient is less than 1).

ACF & PACF for MA(q) models


### set 3 AR coefficients
set.seed(123)
MAp3 <- list(c(0.7), c(-0.7, 0.5, 0.2))

expr <- list(expression(paste("MA(1) with ", theta[1], " = 0.7, ")),
             expression(paste("MA(3) with ", theta[1], " = -0.7, ",
                              theta[2], " = 0.5, ", theta[3], " = 0.2")))

### empty list for storing models
MA3_mods <- vector("list", 2L)

par(mfrow = c(2,3), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(p in 1:2) {
  ## assume SD=1, so not specified
  MA3_mods[[p]] <- arima.sim(n=500, model = list(ma=MAp3[[p]]))
  plot.ts(MA3_mods[[p]][1:50], las = 1,
          ylab = expression(italic(x[t])))
  acf(MA3_mods[[p]], lag.max = 10,
      las = 1, main = "")
  mtext(side = 3, expr[[p]],
        line = 0.2, adj = 0.5, cex = .8)
  pacf(MA3_mods[[p]], lag.max = 10,
       las = 1, main = "")
}

ACF for MA(q) models


### the 4 MA coefficients
set.seed(123)
MAq <- c(0.7, 0.5, -0.2, -0.3)

### empty list for storing models
acf_mods <- vector("list", 4L)

par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(q in 1:4) {
  acf_mods[[p]] <- arima.sim(n=5000, list(ma=MAq[1:q]))
  acf(acf_mods[[p]], lag.max = 15,
       las = 1, main = "")
  mtext(side = 3, paste0("MA(",q,")"),
        line = 0.2, adj = 0, cex = .8)
}

Do you see the link between the order q and lag k?

Using ACF & PACF for model ID (2)

Model ACF PACF
AR(p) Tails off slowly Cuts off after lag p
MA(q) Cuts off after lag q Tails off slowly

Autoregressive moving average models (ARMA)

An autoregressive moving average, or ARMA(p,q), model is written as

\[ x_t = \phi_1 x_{t-1} + \dots + \phi_p x_{t-p} + w_t + \theta_1 w_{t-1} + \dots + \theta_q w_{t-q} \]

We can write an ARMA(p,q) model using the backshift operator

\[ \phi_p (\mathbf{B}) x_t= \theta_q (\mathbf{B}) w_t \]

ARMA models are stationary if all roots of \(\phi_p (\mathbf{B}) > 1\)

ARMA models are invertible if all roots of \(\theta_q (\mathbf{B}) > 1\)

Examples of ARMA(p,q) models


set.seed(123)
arma_mods <- vector("list", 4L)

### ARMA(3,1): phi[1] = 0.7, phi[2] = 0.2, phi[3] = -0.1, theta[1]= 0.5
arma_mods[[1]] <- arima.sim(list(ar=c(0.7, 0.2, -0.1), ma=c(0.5)), n=5000)
### ARMA(2,2): phi[1] = -0.7, phi[2] = 0.2, theta[1] = 0.7, theta[2]= 0.2
arma_mods[[2]] <- arima.sim(list(ar=c(-0.7, 0.2), ma=c(0.7, 0.2)), n=5000)
### ARMA(1,3): phi[1] = 0.7, theta[1] = 0.7, theta[2]= 0.2, theta[3] = 0.5
arma_mods[[3]] <- arima.sim(list(ar=c(0.7), ma=c(0.7, 0.2, 0.5)), n=5000)
### ARMA(2,2): phi[1] = 0.7, phi[2] = 0.2, theta[1] = 0.7, theta[2]= 0.2
arma_mods[[4]] <- arima.sim(list(ar=c(0.7, 0.2), ma=c(0.7, 0.2)), n=5000)

titles <- list(
  expression("ARMA(3,1): "*phi[1]*" = 0.7, 
             "*phi[2]*" = 0.2, "*phi[3]*" = -0.1, "*theta[1]*" = 0.5"),
  expression("ARMA(2,2): "*phi[1]*" = -0.7, 
             "*phi[2]*" = 0.2, "*theta[1]*" = 0.7, "*theta[2]*" = 0.2"),
  expression("ARMA(1,3): "*phi[1]*" = 0.7, 
             "*theta[1]*" = 0.7, "*theta[2]*" = 0.2, "*theta[3]*" = 0.5"),
  expression("ARMA(2,2): "*phi[1]*" = 0.7, 
             "*phi[2]*" = 0.2, "*theta[1]*" = 0.7, "*theta[2]*" = 0.2")
)

par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(i in 1:4) {
  plot.ts(arma_mods[[i]][1:50], las = 1,
          main = "", ylab = expression(italic(x[t])))
  mtext(side = 3, titles[[i]],
        line = 0.2, adj = 0, cex = 0.7)
  
}

ACF for ARMA(p,q) models


par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(i in 1:4) {
  acf(arma_mods[[i]][1:1000], las = 1,
          main = "")
  mtext(side = 3, titles[[i]],
        line = 0.2, adj = 0, cex = 0.8)
  
}

PACF for ARMA(p,q) models


par(mfrow = c(2,2), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.7, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)
### loop over orders of p
for(i in 1:4) {
  pacf(arma_mods[[i]][1:1000], las = 1,
          main = "")
  mtext(side = 3, titles[[i]],
        line = 0.2, adj = 0, cex = 0.8)
  
}

Using ACF & PACF for model ID (3)

Model ACF PACF
AR(p) Tails off slowly Cuts off after lag p
MA(q) Cuts off after lag q Tails off slowly
ARMA(p,q) Tails off slowly Tails off slowly

Autoregressive integrated moving average (ARIMA) models

If the data do not appear stationary, differencing can help

This leads to the class of autoregressive integrated moving average (ARIMA) models

ARIMA models are indexed with orders (p,d,q) where d indicates the order of differencing

ARIMA(p,d,q) models

For \(d > 0\), \(\{x_t\}\) is an ARIMA(p,d,q) process if \((1-\mathbf{B})^d x_t\) is an ARMA(p,q) process

For example, if \(\{x_t\}\) is an ARIMA(1,1,0) process then \(\nabla \{x_t\}\) is an ARMA(1,0) = AR(1) process


set.seed(123)
par(mfrow = c(2,3), mai = c(0.6,0.5,0.3,0.1), omi = c(0,0,0,0), 
    mgp = c(1.6, .5, 0), cex.axis = .8, cex.axis = 0.8, tcl = -.3)

xx <- arima.sim(model=list(ar=0.5, sd=0.1), n=100)
yy <- cumsum(xx)

plot.ts(yy, las = 1,
        ylab=expression(italic(x[t])))
mtext(side = 3, "ARIMA(1,1,0)", line = 0.2, adj = 0, cex = 0.8)
acf(yy, main = "")
pacf(yy, main = "")

plot.ts(diff(yy), las = 1,
        ylab=expression(paste(symbol("\xd1"), italic(x[t]))))
mtext(side = 3, "ARMA(1,0)", line = 0.2, adj = 0, cex = 0.8)
acf(diff(yy), main = "")
pacf(diff(yy), main = "")